04 AdaGrad

Adaptive gradient descent (AdaGrad) is an optimizer that dynamically adapts its learning rate to reflect the partial derivative of the loss with respect to each parameter . As the parameters approach a saddle point in the gradient, the optimizer increases the learning rates of the slow-moving parameters.

As seen below, AdaGrad accumulates squared (i.e., positive) terms for each parameter. As a result, AdaGrad tends to decrease the learning rate over time until it approaches zero. RMSprop was developed to address this problem, and Adam has largely superseded both as it also incorporates momentum.

To achieve adaptation, AdaGrad maintains a state vector consisting of the sum over all iterations of the squared partial derivatives of each parameter in the model. The value for parameter is

While the above is technically true, in practice the vector is updated iteratively as

AdaGrad then uses this state vector to calculate the change in parameters :

where is the element-wise product. The presence of the (constantly growing) state vector in the denominator of the left factor is the reason that AdaGrad tends to reduce the learning rate to near zero.

David's raw ML reference notes

Explorer

04 AdaGrad

Graph View

Backlinks